Open In Colab

cardio disease analysis¶

In [69]:
# colab related
!pip install matplotlib --upgrade
!pip install sklearn --upgrade
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib in /home/giovo17/.local/lib/python3.10/site-packages (3.5.3)
Requirement already satisfied: cycler>=0.10 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (4.37.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.17 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (1.23.2)
Requirement already satisfied: packaging>=20.0 in /usr/lib/python3.10/site-packages (from matplotlib) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /usr/lib/python3.10/site-packages (from matplotlib) (9.2.0)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3.10/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /home/giovo17/.local/lib/python3.10/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: sklearn in /home/giovo17/.local/lib/python3.10/site-packages (0.0)
Requirement already satisfied: scikit-learn in /home/giovo17/.local/lib/python3.10/site-packages (from sklearn) (1.1.2)
Requirement already satisfied: numpy>=1.17.3 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.23.2)
Requirement already satisfied: scipy>=1.3.2 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.9.1)
Requirement already satisfied: joblib>=1.0.0 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/giovo17/.local/lib/python3.10/site-packages (from scikit-learn->sklearn) (3.1.0)
In [70]:
# Importing main libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, ParameterGrid
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.cluster import KMeans
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_auc_score, RocCurveDisplay


plt.rcParams.update({'figure.figsize': (10.0, 10.0)})
plt.rcParams.update({'font.size': 12})
#plt.rcParams.update({'figure.dpi': 300})
In [71]:
data_source = "https://raw.githubusercontent.com/Giovo17/cardio-disease-analysis/main/cardio_train.csv"
df = pd.read_csv(data_source, sep=";", index_col="id")
df = df.rename(columns={"ap_hi": "systolic_bp", "ap_lo": "diastolic_bp",
                        "gluc": "glucose", "alco": "alcool_intake",
                        "active": "physical_activity", "cardio": "cardio_disease"})
In [72]:
df.head()
Out[72]:
age gender height weight systolic_bp diastolic_bp cholesterol glucose smoke alcool_intake physical_activity cardio_disease
id
0 18393 2 168 62.0 110 80 1 1 0 0 1 0
1 20228 1 156 85.0 140 90 3 1 0 0 1 1
2 18857 1 165 64.0 130 70 3 1 0 0 0 1
3 17623 2 169 82.0 150 100 1 1 0 0 1 1
4 17474 1 156 56.0 100 60 1 1 0 0 0 0

Data description reported by authors¶

There are 3 types of input features:

  1. Objective: factual information;
  2. Examination: results of medical examination;
  3. Subjective: information given by the patient.
Feature Feature type Name in dataset Data type
Age Objective Feature age int (days)
Gender Objective Feature gender categorical code (1: female, 2: male)
Height Objective Feature height int (cm)
Weight Objective Feature weight float (kg)
Systolic blood pressure Examination Feature systolic_bp int
Diastolic blood pressure Examination Feature diastolic_bp int
Cholesterol Examination Feature cholesterol 1: normal, 2: above normal, 3: well above normal
Glucose Examination Feature glucose 1: normal, 2: above normal, 3: well above normal
Smoking Subjective Feature smoke binary
Alcohol intake Subjective Feature alcool binary
Physical activity Subjective Feature physical_activity binary
Presence or absence of cardiovascular disease Target Variable cardio_disease binary

All of the dataset values were collected at the moment of medical examination.

In [73]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                70000 non-null  int64  
 1   gender             70000 non-null  int64  
 2   height             70000 non-null  int64  
 3   weight             70000 non-null  float64
 4   systolic_bp        70000 non-null  int64  
 5   diastolic_bp       70000 non-null  int64  
 6   cholesterol        70000 non-null  int64  
 7   glucose            70000 non-null  int64  
 8   smoke              70000 non-null  int64  
 9   alcool_intake      70000 non-null  int64  
 10  physical_activity  70000 non-null  int64  
 11  cardio_disease     70000 non-null  int64  
dtypes: float64(1), int64(11)
memory usage: 6.9 MB

There are no missing values

In [74]:
df["gender"] = df["gender"].map({1: 0, 2: 1})

Add the BMI feature from height and weight

In [75]:
BMI = df["weight"] / (df["height"] / 100)**2
df.insert (4, "BMI", BMI)

Convert age in years

In [76]:
df["age"] = (df["age"]/365).astype(int)
In [77]:
df.head()
Out[77]:
age gender height weight BMI systolic_bp diastolic_bp cholesterol glucose smoke alcool_intake physical_activity cardio_disease
id
0 50 1 168 62.0 21.967120 110 80 1 1 0 0 1 0
1 55 0 156 85.0 34.927679 140 90 3 1 0 0 1 1
2 51 0 165 64.0 23.507805 130 70 3 1 0 0 0 1
3 48 1 169 82.0 28.710479 150 100 1 1 0 0 1 1
4 47 0 156 56.0 23.011177 100 60 1 1 0 0 0 0

Search for duplicated rows

In [78]:
print("Duplicate rows: {}".format(df.duplicated().sum()))
df = df.drop_duplicates()
Duplicate rows: 3208
In [79]:
df.shape
Out[79]:
(66792, 13)

There are no missing values.

Data exploration and cleaning¶

In [80]:
def map_values(dataframe, to_numeric=False):
    if to_numeric:
        dataframe["gender"] = dataframe["gender"].map({"female": 0, "male": 1})
        dataframe["cholesterol"] = dataframe["cholesterol"].map({"normal": 1, "above normal": 2, "well above normal": 3})
        dataframe["glucose"] = dataframe["glucose"].map({"normal": 1, "above normal": 2, "well above normal": 3})
        dataframe["smoke"] = dataframe["smoke"].map({"no": 0, "yes": 1})
        dataframe["alcool_intake"] = dataframe["alcool_intake"].map({"no": 0, "yes": 1})
        dataframe["physical_activity"] = dataframe["physical_activity"].map({"inactive": 0, "active": 1})
        dataframe["cardio_disease"] = dataframe["cardio_disease"].map({"healthy": 0, "sick": 1})

    else:
        dataframe["gender"] = dataframe["gender"].map({0: "female", 1: "male"})
        dataframe["cholesterol"] = dataframe["cholesterol"].map({1: "normal", 2: "above normal", 3: "well above normal"})
        dataframe["glucose"] = dataframe["glucose"].map({1: "normal", 2: "above normal", 3: "well above normal"})
        dataframe["smoke"] = dataframe["smoke"].map({0: "no", 1: "yes"})
        dataframe["alcool_intake"] = dataframe["alcool_intake"].map({0: "no", 1: "yes"})
        dataframe["physical_activity"] = dataframe["physical_activity"].map({0: "inactive", 1: "active"})
        dataframe["cardio_disease"] = dataframe["cardio_disease"].map({0: "healthy", 1: "sick"})


    return dataframe
In [81]:
df = map_values(df, to_numeric=False)

Categorical data¶

In [82]:
def get_percentages(ax_container):
    perc = []
    sum = 0
    for k in ax_container:
        sum += k.get_height()
    
    for k in ax_container:
        lab = str(k.get_height()) + "  (" + str(round(k.get_height() / sum * 100, 1)) + " %)"
        perc.append(lab)

    return perc
In [83]:
ax = sns.countplot(data=df, x="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.set_title("Cardio disease countplot")

plt.show()

The target variable cardio_disease is balanced

In [84]:
ax = sns.countplot(data=df, x="gender", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Gender countplot - cardio hue")
plt.show()

The gender of the patient doesn't seem to have a noticeable correlation with the target variable

In [85]:
ax = sns.countplot(data=df, x="cholesterol", hue="cardio_disease", order=["normal", "above normal", "well above normal"])
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Cholesterol countplot - cardio hue")
plt.show()

As it can be seen on the graph the cholesterol has an impact on the target variable.

In [86]:
ax = sns.countplot(data=df, x="glucose", hue="cardio_disease", order=["normal", "above normal", "well above normal"])
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Glucose countplot - cardio hue")
plt.show()

The trend for the glucose variable is the same of the cholesterol one, but the differences between healthy and sick patients are more subtle.

In [87]:
ax = sns.countplot(data=df, x="smoke", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Smoke countplot - cardio hue")
plt.show()

The smoke feature looks like it's uncorrelated with the health of the patient

In [88]:
ax = sns.countplot(data=df, x="alcool_intake", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Alcool intake countplot - cardio hue")
plt.show()
In [89]:
ax = sns.countplot(data=df, x="physical_activity", hue="cardio_disease")
ax.bar_label(ax.containers[0], get_percentages(ax.containers[0]))
ax.bar_label(ax.containers[1], get_percentages(ax.containers[1]))
ax.set_title("Physical activity countplot - cardio hue")
plt.show()

Alcool intake and physical activity follow the same path of the smoke feature

Numerical data¶

Age

In [90]:
ax = sns.boxplot(data=df, x="age", y="cardio_disease", orient="h")
ax.set_title("Age boxplot - cardio hue")
plt.show()

df.groupby("cardio_disease")["age"].describe()
Out[90]:
count mean std min 25% 50% 75% max
cardio_disease
healthy 32599.0 51.220191 6.847514 29.0 46.0 52.0 57.0 64.0
sick 34193.0 54.422835 6.380788 39.0 50.0 55.0 60.0 64.0
In [91]:
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="age", bins=20)
ax.set_title("Age histogram")

plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="age", bins=20, hue="cardio_disease")
ax.set_title("Age histogram - cardio hue")

plt.subplots_adjust(top=1)
plt.show()

The patients age ranges from 29 to 64, so there are only adults.

The distribution seems to be bimodal with modes around 55 and 58.

Taking into consideration the conditional boxplot this feature is slightly related to the target variable, in fact the "cardio_disease affected" patients box is has higher minimum, 1° quartile, median and 3° quartile with respect to healthy patients. Observing the conditional histplot this trend is confirmed: there's an higher concentration of unhealthy patients as the age increases. Though in both there's a considerable overlap between the two classes.

Height, weight and BMI

The BMI is the Body Mass Index and it's defined as $ BMI = \frac{w}{h^2} $ where $w$ is the weight in kilograms and $h$ is the height in meters

A reference table from Ministero della salute:

Condition BMI
Severe thinness BMI < 16
Underweight 16 < BMI < 18.49
Normal weight 18.5 < BMI < 24.99
Overweight 25 < BMI < 29.99
Obese class 1 30 < BMI < 34.99
Obese class 2 35 < BMI < 39.99
Obese class 3 BMI > 40
In [92]:
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="height", bins=50)
ax.set_xlabel("height ($cm$)")
ax.set_title("Height histogram")

plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="weight", bins=50)
ax.set_xlabel("weight ($kg$)")
ax.set_title("Weight histogram")

plt.subplots_adjust(top=1)
plt.show()
In [93]:
plt.subplot(2, 1, 1)
ax = sns.boxplot(data=df, x="height")
ax.set_xlabel("height ($cm$)")
ax.set_title("Height boxplot")

plt.subplot(2, 1, 2)
ax = sns.boxplot(data=df, x="weight")
ax.set_xlabel("weight ($kg$)")
ax.set_title("Weight boxplot")

plt.subplots_adjust(top=1)
plt.show()
In [94]:
pd.DataFrame(df["height"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
Out[94]:
count mean std min 1% 25% 50% 75% 99.9% max
height 66792.0 164.341748 8.333904 55.0 146.0 159.0 165.0 170.0 190.0 250.0
In [95]:
pd.DataFrame(df["weight"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
Out[95]:
count mean std min 1% 25% 50% 75% 99.9% max
weight 66792.0 74.52116 14.580675 10.0 48.0 65.0 72.0 83.0 150.0 200.0

Both height and weight are unimodal distributions with modes around 165 (cm) and 65 (kg) respectively.

The height distribution looks like it isn't skewed. The weight distribution seems to be slightly positive skewed as it's right tail it's a bit longer than the left one.

These features present outliers as it can be seen from the boxplot and the correspondig statistics table.

Checking weight skeweness:

In [96]:
print("Mode: {}, median: {}, mean: {}".format(stats.mode(df["weight"], keepdims=True)[0][0], np.median(df["weight"]), round(np.mean(df["weight"]), 2)))
print("Fisher-Pearson coefficient of skewness: {}".format(round(stats.skew(df["weight"]), 2)))
Mode: 70.0, median: 72.0, mean: 74.52
Fisher-Pearson coefficient of skewness: 0.97
In [97]:
ax = sns.boxplot(data=df, x="BMI")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI boxplot")
plt.show()

pd.DataFrame(df["BMI"].describe(percentiles=[0.25, 0.5, 0.75, 0.999])).T
Out[97]:
count mean std min 25% 50% 75% 99.9% max
BMI 66792.0 27.682565 6.184422 3.471784 23.875115 26.573129 30.46875 59.623333 298.666667
In [98]:
plt.subplot(2, 1, 1)
ax = sns.boxplot(data=df, x="BMI", y="cardio_disease", orient="h")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI boxplot - cardio hue")

plt.subplot(2, 1, 2)
ax = sns.boxplot(data=df, x="BMI", y="cardio_disease", orient="h")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_xlim(13, 45)

plt.show()

df.groupby("cardio_disease")["BMI"].describe(percentiles=[0.25, 0.5, 0.75, 0.999])
Out[98]:
count mean std min 25% 50% 75% 99.9% max
cardio_disease
healthy 32599.0 26.674952 5.755386 7.022248 23.372576 25.636917 29.060607 60.089236 237.768633
sick 34193.0 28.643205 6.421927 3.471784 24.560326 27.548209 31.615793 59.458581 298.666667
In [99]:
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="BMI", bins=100, kde=True)
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI histogram")

plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="BMI", bins=100, hue="cardio_disease")
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI histogram - cardio hue")

plt.subplots_adjust(top=1)
plt.show()
In [100]:
ax = sns.histplot(data=df, x="BMI", bins=100, hue="cardio_disease")
ax.set_xlim(0, 70)
ax.set_xlabel("BMI ($kg/m^2$)")
ax.set_title("BMI histogram - cardio hue")

plt.show()
In [101]:
print("Mode: {}, median: {}, mean: {}".format(round(stats.mode(df["BMI"], keepdims=True)[0][0], 2), round(np.median(df["BMI"]), 2), round(np.mean(df["BMI"]), 2)))
print("Fisher-Pearson coefficient of skewness: {}".format(round(stats.skew(df["BMI"]), 2)))
Mode: 23.88, median: 26.57, mean: 27.68
Fisher-Pearson coefficient of skewness: 7.69

The BMI feature is unimodal (mode in 23.88) and positively skewed as shown by the Fisher-Pearson coefficient.

There are a lot of outliers in this features.

Systolic blood pressure and diastolic blood pressure

These features measures the pressure in arteries respectively when the heart beats and in a period between two heatbeats.

A reference table from heart.org:

Blood pressure category Systolic blood pressure (mm Hg) and/or Diastolic blood pressure (mm Hg)
Normal systolic_bp < 120 and diastolic_bp < 80
Elevated 120 < systolic_bp < 129 and diastolic_bp < 80
High blood pressure (Hypertension stage 1) 130 < systolic_bp < 139 or 80 < diastolic_bp < 89
High blood pressure (Hypertension stage 2) systolic_bp > 140 or diastolic_bp > 90
Hypertensive crisis systolic_bp > 180 and/or diastolic_bp > 120
In [102]:
ax = sns.boxplot(data=df, x="systolic_bp")
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure boxplot")
plt.show()

pd.DataFrame(df["systolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
Out[102]:
count mean std min 1% 25% 50% 75% 99.9% max
systolic_bp 66792.0 129.231585 157.649354 -150.0 90.0 120.0 120.0 140.0 220.0 16020.0
In [103]:
ax = sns.boxplot(data=df, x="systolic_bp", y="cardio_disease", orient="h")
ax.set_xlim(80, 180)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure boxplot - cardio hue")

plt.show()

df.groupby("cardio_disease")["systolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
Out[103]:
count mean std min 1% 25% 50% 75% 99% max
cardio_disease
healthy 32599.0 120.528728 107.320611 -120.0 90.0 110.0 120.0 120.0 160.0 14020.0
sick 34193.0 137.528734 193.460336 -150.0 100.0 120.0 130.0 140.0 180.0 16020.0
In [104]:
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="systolic_bp", bins=1700)
ax.set_xlim(-50, 300)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure histogram")

plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="systolic_bp", bins=1700, hue="cardio_disease")
ax.set_xlim(-50, 300)
ax.set_xlabel("systolic_bp ($mm Hg$)")
ax.set_title("Systolic blood pressure histogram - cardio hue")

plt.subplots_adjust(top=1)
plt.show()
In [105]:
ax = sns.boxplot(data=df, x="diastolic_bp")
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure boxplot")

plt.show()

pd.DataFrame(df["diastolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.999])).T
Out[105]:
count mean std min 1% 25% 50% 75% 99.9% max
diastolic_bp 66792.0 97.446221 192.906434 -70.0 60.0 80.0 80.0 90.0 1110.0 11000.0
In [106]:
ax = sns.boxplot(data=df, x="diastolic_bp", y="cardio_disease", orient="h")
ax.set_xlim(50, 120)
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure boxplot - cardio hue")

plt.show()

df.groupby("cardio_disease")["diastolic_bp"].describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
Out[106]:
count mean std min 1% 25% 50% 75% 99% max
cardio_disease
healthy 32599.0 84.634743 158.248448 0.0 60.0 70.0 80.0 80.0 100.0 9800.0
sick 34193.0 109.660457 220.252704 -70.0 60.0 80.0 80.0 90.0 1000.0 11000.0
In [107]:
plt.subplot(2, 1, 1)
ax = sns.histplot(data=df, x="diastolic_bp", bins=1300)
ax.set_xlim(-50, 250)
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure")

plt.subplot(2, 1, 2)
ax = sns.histplot(data=df, x="diastolic_bp", bins=1300, hue="cardio_disease")
ax.set_xlim(-50, 250)
ax.set_xlabel("diastolic_bp ($mm Hg$)")
ax.set_title("Diastolic blood pressure - cardio hue")

plt.subplots_adjust(top=1)
plt.show()

These features have a similar behaviour. Both of them present outliers as it can be seen from the boxplots.

When not considering outliers, they are unimodal (mode around 120 mmHg and 80 mmHg respectively) and approximatevely simmetric. They are related to the cardio_disease feature as it can be seen from the boxplot and the histogram, though there's overlap between healthy and unhealthy patients.

In [108]:
sns.pairplot(data=df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cardio_disease"]], hue='cardio_disease')
plt.show()
In [109]:
ax = sns.heatmap(df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]].corr(), annot=True, fmt=".2g", linewidths=.1, center=0)
ax.set_title("Numeric variables correlation matrix")

plt.show()

Analizing relationship between ap_hi and ap_lo

In [110]:
# removing systolic_bp and diastolic_bp outliers
df_cleaned = df.copy()
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["systolic_bp"])) < 1.5)]
df_cleaned = df_cleaned[df_cleaned.systolic_bp > 0]
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["diastolic_bp"])) < 1.5)]
df_cleaned = df_cleaned[df_cleaned.diastolic_bp > 0]

print("Original dataset: {}".format(round(np.corrcoef(df["systolic_bp"], df["diastolic_bp"])[0][1], 3)))
print("Dataset without outliers: {}".format(round(np.corrcoef(df_cleaned["systolic_bp"], df_cleaned["diastolic_bp"])[0][1], 3)))

display(df.shape)
display(df_cleaned.shape)
Original dataset: 0.016
Dataset without outliers: 0.649
(66792, 13)
(65777, 13)
In [111]:
ax = sns.jointplot(data=df_cleaned, x="systolic_bp", y="diastolic_bp", hue="cardio_disease")
ax.set_axis_labels(xlabel="systolic_bp ($mm Hg$)", ylabel="diastolic_bp ($mm Hg$)")
#ax.set_title("Systolic vs diastolic blood pressure scatterplot")

plt.show()

As it can be seen from the scatterplot the two variables that represent the patient blood pressure are very correlated, but this correlation is hidden in the original data due to outliers presence.

Analizing relationship between BMI, weight and height

In [112]:
# removing height and weight outliers
display(df_cleaned.shape)

df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["height"])) < 4)]
df_cleaned = df_cleaned[(np.abs(stats.zscore(df_cleaned["weight"])) < 4)]

display(df_cleaned.shape)
(65777, 13)
(65493, 13)
In [113]:
plt.subplot(2, 2, 1)
ax = sns.scatterplot(data=df, x="weight", y="BMI")
ax.set_xlabel("weight ($kg$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Weight vs BMI scatterplot")

plt.subplot(2, 2, 2)
ax = sns.scatterplot(data=df_cleaned, x="weight", y="BMI")
ax.set_xlabel("weight ($kg$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Weight vs BMI scatterplot - cleaned dataset")

plt.subplot(2, 2, 3)
ax = sns.scatterplot(data=df, x="height", y="BMI")
ax.set_xlabel("height ($cm$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Height vs BMI scatterplot")

plt.subplot(2, 2, 4)
ax = sns.scatterplot(data=df_cleaned, x="height", y="BMI")
ax.set_xlabel("height ($cm$)")
ax.set_ylabel("BMI ($kg/m^2$)")
ax.set_title("Height vs BMI scatterplot - cleaned dataset")

plt.subplots_adjust(top=1)
plt.show()
In [114]:
sns.pairplot(data=df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cardio_disease"]], hue='cardio_disease')
plt.show()
In [115]:
ax = sns.heatmap(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]].corr(), annot=True, fmt=".2g", linewidths=.1, center=0)
ax.set_title("Numeric variables correlation matrix - cleaned dataset")

plt.show()
In [116]:
df = map_values(df, to_numeric=True)
df_cleaned = map_values(df_cleaned, to_numeric=True)

Dimensionality reduction¶

PCA

In [117]:
df.head()
Out[117]:
age gender height weight BMI systolic_bp diastolic_bp cholesterol glucose smoke alcool_intake physical_activity cardio_disease
id
0 50 1 168 62.0 21.967120 110 80 1 1 0 0 1 0
1 55 0 156 85.0 34.927679 140 90 3 1 0 0 1 1
2 51 0 165 64.0 23.507805 130 70 3 1 0 0 0 1
3 48 1 169 82.0 28.710479 150 100 1 1 0 0 1 1
4 47 0 156 56.0 23.011177 100 60 1 1 0 0 0 0
In [118]:
scaler = StandardScaler()
scaler.fit(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
df_1 = scaler.transform(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])

pca = PCA()
df_1_reduced = pca.fit_transform(df_1)

def biplot(score, coeff, labels=["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())

    plt.scatter(xs * scalex, ys * scaley) # Display data points
    
    # Diplay arrows and labels
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'r', alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var" + tr(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    
    # Plot settings
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.title("Biplot - 2 components")
    plt.grid()
In [119]:
PCs = np.arange(pca.n_components_) + 1
cumulative_explained_variance = []
sum = 0
for i in pca.explained_variance_ratio_:
    sum += i
    cumulative_explained_variance.append(sum)

plt.plot(PCs, pca.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.plot(PCs, cumulative_explained_variance, 'o-', linewidth=2, color='red')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')

plt.show()

The elbow on the scree plot occurs when choosing 4 PCs, but I will proceed with 2 and 3 PCs for data visualization's sake

In [120]:
biplot(df_1_reduced[:,0:2], np.transpose(pca.components_[0:2, :]))
plt.show()
In [121]:
def threeD_biplot(score, coeff, labels=["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]):
    xs = score[:,0]
    ys = score[:,1]
    zs = score[:,2]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    scalez = 1.0/(zs.max() - zs.min())
    
    fig = plt.figure()
    ax = fig.add_subplot(projection='3d')
    ax.scatter(xs * scalex, ys * scaley, zs * scalez)
    '''
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, coeff[i,2] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, coeff[i,2] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    
    
    ax.xlim(-1,1)
    ax.ylim(-1,1)
    ax.zlim(-1,1)
    '''
    
    ax.set_xlabel("PC{}".format(1))
    ax.set_ylabel("PC{}".format(2))
    ax.set_zlabel("PC{}".format(3))
    ax.grid()
    ax.set_title("Biplot - 3 components")


threeD_biplot(df_1_reduced[:,0:3], np.transpose(pca.components_[0:3, :]))
plt.show()

Data preprocessing¶

In [122]:
# Checking features' variance
print("Plain dataset: ")
for col in df.columns:
    if col != "cardio_disease":
        print("Variance of {}: {}".format(col, np.var(df[col])))

print("\nCleaned dataset: ")
for col in df_cleaned.columns:
    if col != "cardio_disease":
        print("Variance of {}: {}".format(col, np.var(df_cleaned[col])))
Plain dataset: 
Variance of age: 46.289238206435286
Variance of gender: 0.22932452924542104
Variance of height: 69.45291758339454
Variance of weight: 212.59289836759055
Variance of BMI: 38.246500912147596
Variance of systolic_bp: 24852.946817718945
Variance of diastolic_bp: 37212.33532198777
Variance of cholesterol: 0.4762754320027826
Variance of glucose: 0.33879216792157657
Variance of smoke: 0.0836475960946206
Variance of alcool_intake: 0.05312513373169617
Variance of physical_activity: 0.16087461644691523

Cleaned dataset: 
Variance of age: 46.318408138141294
Variance of gender: 0.22901034234891376
Variance of height: 62.82067588312376
Variance of weight: 196.39850624082894
Variance of BMI: 26.280408307490895
Variance of systolic_bp: 323.72448902929665
Variance of diastolic_bp: 99.73946236693824
Variance of cholesterol: 0.4741720050100033
Variance of glucose: 0.3378730537927973
Variance of smoke: 0.0833820167178163
Variance of alcool_intake: 0.05286931497493984
Variance of physical_activity: 0.1611724652999777

Smoke and alcool_intake have a very low variance, it can be considered to remove these features

In [123]:
# Onehot encoding cholesterol and glucose features

encoder = OneHotEncoder()
onehotarray = encoder.fit_transform(df_cleaned[["cholesterol"]]).toarray()
items = [f'{"cholesterol"}_{item}' for item in encoder.categories_[0]]
df_cleaned[items] = onehotarray

onehotarray = encoder.fit_transform(df_cleaned[["glucose"]]).toarray()
items = [f'{"glucose"}_{item}' for item in encoder.categories_[0]]
df_cleaned[items] = onehotarray

df_cleaned = df_cleaned.drop(columns=["cholesterol", "glucose"])


onehotarray = encoder.fit_transform(df[["cholesterol"]]).toarray()
items = [f'{"cholesterol"}_{item}' for item in encoder.categories_[0]]
df[items] = onehotarray

onehotarray = encoder.fit_transform(df[["glucose"]]).toarray()
items = [f'{"glucose"}_{item}' for item in encoder.categories_[0]]
df[items] = onehotarray

df = df.drop(columns=["cholesterol", "glucose"])
In [124]:
scaler_df = StandardScaler()
df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]] = scaler_df.fit_transform(df[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])

scaler_df_cleaned = StandardScaler()
df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]] = scaler_df_cleaned.fit_transform(df_cleaned[["age", "height", "weight", "BMI", "systolic_bp", "diastolic_bp"]])
In [125]:
df_cleaned.head()
Out[125]:
age gender height weight BMI systolic_bp diastolic_bp smoke alcool_intake physical_activity cardio_disease cholesterol_1 cholesterol_2 cholesterol_3 glucose_1 glucose_2 glucose_3
id
0 -0.418592 1 0.452729 -0.873667 -1.079791 -0.922284 -0.141927 0 0 1 0 1.0 0.0 0.0 1.0 0.0 0.0
1 0.316079 0 -1.061285 0.767523 1.448387 0.745092 0.859378 0 0 1 1 0.0 0.0 1.0 1.0 0.0 0.0
2 -0.271658 0 0.074225 -0.730955 -0.779254 0.189300 -1.143232 0 0 0 1 0.0 0.0 1.0 1.0 0.0 0.0
3 -0.712461 1 0.578897 0.553454 0.235616 1.300884 1.860684 0 0 1 1 1.0 0.0 0.0 1.0 0.0 0.0
4 -0.859395 0 -1.061285 -1.301803 -0.876130 -1.478076 -2.144537 0 0 0 0 1.0 0.0 0.0 1.0 0.0 0.0

Predictive analysis¶

In [126]:
#dividing the dataset into two sets: train set and test set
def tt_split(dataframe):
  x = dataframe.loc[:, dataframe.columns!='cardio_disease']
  y = dataframe['cardio_disease']
  
  X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)

  return X_train, X_test, y_train, y_test
In [127]:
classifiers = {
    "Decision Tree": (DecisionTreeClassifier(), {"predict_proba": True}, {'criterion': ("gini", "entropy"),
                                                                          'splitter': ("best", "random"),
                                                                          'class_weight': ["balanced"],
                                                                          'random_state': [1] }),

    "Random Forest": (RandomForestClassifier(), {"predict_proba": True},  {'n_estimators': [100],
                                                                           'criterion': ("gini", "entropy"),
                                                                           'class_weight': ["balanced"],
                                                                           'max_features': ("sqrt", "log2"),
                                                                           'random_state': [1] }),

    "XGBClassifier": (XGBClassifier(), {"predict_proba": True}, {'n_estimators': [100],
                                                                 'learning_rate': (0.01, 0.05, 0.10, 0.20, 0.30),
                                                                 'tree_method': ("exact", "approx", "hist"),
                                                                 'random_state': [1] }),

    "Nearest Neighbors": (KNeighborsClassifier(), {"predict_proba": True}, {'n_neighbors': (5, 7, 9), 
                                                                            'weights': ("uniform", "distance"),
                                                                            'algorithm': ("ball_tree", "kd_tree"),
                                                                            'p': (1, 2, 3),
                                                                            'n_jobs': [-1] }),

    "Logistic Regression": (LogisticRegression(), {"predict_proba": True}, {'C': (0.0001, 0.001, 0.01, 0.1, 1, 2, 5, 10),
                                                                            'solver': ('lbfgs', 'sag', 'saga'),
                                                                            'max_iter': [400],
                                                                            'n_jobs': [-1],
                                                                            'random_state': [1] }),

    #"SVC": (SVC(), {"predict_proba": True}, {'kernel': ("linear", "poly", "rbf"),
    #                                         'solver': ('lbfgs', 'sag', 'saga'),
    #                                         'C': (1, 5, 10),
    #                                         'random_state': [1] }),

    "LinearSVC": (LinearSVC(), {"predict_proba": False}, {'C': (0.0001, 0.001, 0.01, 0.1, 1, 2, 5, 10),
                                                          'max_iter': [2000], 
                                                          'random_state': [1] }),

    "Kmeans": (KMeans(), {"predict_proba": False}, {'n_clusters': [2],
                                                    'init': ("k-means++", "random"),
                                                    'algorithm': ("lloyd", "elkan"),
                                                    'random_state': [1] }),

    "MLPClassifier": (MLPClassifier(), {"predict_proba": True}, {'hidden_layer_sizes': ((8, 4), (8, 4, 4)), 
                                                                 'activation': ["relu"],
                                                                 'learning_rate': ("constant", "adaptive"),
                                                                 'learning_rate_init': (0.001, 0.005, 0.01, 0.1, 0.15, 0.2),
                                                                 'max_iter': [400],
                                                                 'random_state': [1] })
}
In [128]:
result_matrixes = dict()


def classification(classifiers, X_train, y_train, X_test, y_test, n_iter=5):
  result_matrix = pd.DataFrame(columns=["Classifier", "Accuracy", "Accuracy (train)",  "Precision", "Precision (train)", 
                                        "Recall", "Recall (train)", "F1-Score", "F1-Score (train)", "ROC AUC", "ROC AUC (train)"])

  for name, clf in classifiers.items():
    print("Classifier: ", name)

    # Hyperparameters optimization
    if clf[2] != None:
      if len(list(ParameterGrid(clf[2]))) < n_iter: classifier = GridSearchCV(clf[0], clf[2], n_jobs=-1)
      else: classifier = RandomizedSearchCV(clf[0], clf[2], n_jobs=-1, n_iter=n_iter, random_state=1)
      classifier.fit(X_train, y_train)
      print("Best hyperparameters : {}".format(classifier.best_params_))
    else:
      classifier = clf[0]
      classifier.fit(X_train, y_train)
    
    # Predicion task handled by the best estimator found by the hyperparameter cross validator
    y_pred = classifier.predict(X_test)
    y_pred_train = classifier.predict(X_train)
            
    # Getting predicted probabilities
    if clf[1]["predict_proba"] == True:
      y_score = classifier.predict_proba(X_test)[:,1]
      #display("Predicted probability: ", y_score)
      y_score_train = classifier.predict_proba(X_train)[:,1]
      #display("Predicted probability: ", y_score_train)

    # Test set metrics
    pr, rc, fs, sup = metrics.precision_recall_fscore_support(y_test, y_pred, average='macro')
    pr_train, rc_train, fs_train, sup_train = metrics.precision_recall_fscore_support(y_train, y_pred_train, average='macro')
    result_matrix = pd.concat([result_matrix, pd.DataFrame({"Classifier": name,
                                                            "Accuracy": round(metrics.accuracy_score(y_test, y_pred), 4),
                                                            "Accuracy (train)": round(metrics.accuracy_score(y_train, y_pred_train), 4),
                                                            "Precision": round(pr, 4),
                                                            "Precision (train)": round(pr_train, 4),
                                                            "Recall": round(rc, 4),
                                                            "Recall (train)": round(rc_train, 4),
                                                            "F1-Score": round(fs, 4),
                                                            "F1-Score (train)": round(fs_train, 4),
                                                            "ROC AUC": roc_auc_score(y_test, y_score) if clf[1]["predict_proba"] else None,
                                                            "ROC AUC (train)": roc_auc_score(y_train, y_score_train) if clf[1]["predict_proba"] else None }, index=[0])])

    # Confusion matrix for test set
    cf_matrix = confusion_matrix(y_test, y_pred)
    plt.subplot(2, 1, 1)
    cf_plot = sns.heatmap(cf_matrix, annot=True, fmt="d", cmap='Blues')
    cf_plot.set_title("Confusion matrix - test set")
    cf_plot.set_xlabel("Predicted label")
    cf_plot.set_ylabel("True label")

    # Confusion matrix for train set
    plt.subplot(2, 1, 2)
    cf_matrix = confusion_matrix(y_train, y_pred_train)
    cf_plot_train = sns.heatmap(cf_matrix, annot=True, fmt="d", cmap='Blues')
    cf_plot_train.set_title("Confusion matrix - training set")
    cf_plot_train.set_xlabel("Predicted label")
    cf_plot_train.set_ylabel("True label")

    plt.subplots_adjust(top=1)
    plt.show()

    # ROC AUC curve
    if name != "Kmeans":
      RocCurveDisplay.from_estimator(classifier, X_test, y_test)  
      print("ROC curve - test set")
      plt.show()
      RocCurveDisplay.from_estimator(classifier, X_train, y_train)  
      print("ROC curve - training set")
      plt.show()


  result_matrix.set_index("F1-Score", inplace=True)
  result_matrix.sort_values(by="F1-Score", ascending=False, inplace=True)
  
  return result_matrix

Plain dataset¶

In [129]:
X_train, X_test, y_train, y_test = tt_split(df)

result_matrixes["Plain dataset"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Plain dataset"])
Classifier:  Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'random_state': 1, 'splitter': 'best'}
ROC curve - test set
ROC curve - training set
Classifier:  Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'gini', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  XGBClassifier
Best hyperparameters : {'tree_method': 'approx', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.1}
ROC curve - test set
ROC curve - training set
Classifier:  Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier:  Logistic Regression
Best hyperparameters : {'solver': 'lbfgs', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 5}
ROC curve - test set
ROC curve - training set
Classifier:  LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
Best hyperparameters : {'C': 10, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier:  MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.01, 'learning_rate': 'constant', 'hidden_layer_sizes': (8, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7287 XGBClassifier 0.7287 0.7483 0.7297 0.7493 0.7296 0.7489 0.7482 0.796294 0.824923
0.7257 MLPClassifier 0.7257 0.7334 0.7269 0.7345 0.7267 0.7340 0.7333 0.793346 0.800074
0.7193 Logistic Regression 0.7193 0.7229 0.7202 0.7238 0.7201 0.7235 0.7229 0.782172 0.786405
0.7042 LinearSVC 0.7042 0.7093 0.7047 0.7100 0.7048 0.7098 0.7093 None None
0.6953 Random Forest 0.6955 0.9834 0.6953 0.9833 0.6954 0.9835 0.9834 0.747861 0.99942
0.6571 Nearest Neighbors 0.6571 0.7368 0.6574 0.7370 0.6575 0.7370 0.7368 0.710222 0.811528
0.6257 Decision Tree 0.6259 0.9834 0.6257 0.9836 0.6258 0.9838 0.9834 0.624746 0.999449
0.5669 Kmeans 0.5776 0.5796 0.5943 0.5944 0.5832 0.5834 0.5685 None None

Cleaned dataset¶

In [130]:
X_train, X_test, y_train, y_test = tt_split(df_cleaned)

result_matrixes["Cleaned dataset"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Cleaned dataset"])
Classifier:  Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'gini', 'random_state': 1, 'splitter': 'best'}
ROC curve - test set
ROC curve - training set
Classifier:  Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier:  Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier:  Logistic Regression
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 0.1}
ROC curve - test set
ROC curve - training set
Classifier:  LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
Best hyperparameters : {'C': 10, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier:  MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7272 MLPClassifier 0.7276 0.7332 0.7307 0.7355 0.7287 0.7337 0.7328 0.79295 0.800408
0.7269 XGBClassifier 0.7272 0.7424 0.7297 0.7443 0.7281 0.7428 0.7421 0.793989 0.817191
0.7194 Logistic Regression 0.7196 0.7242 0.7216 0.7257 0.7205 0.7246 0.7240 0.781189 0.788357
0.7184 LinearSVC 0.7186 0.7229 0.7208 0.7245 0.7195 0.7233 0.7226 None None
0.7051 Nearest Neighbors 0.7051 0.7592 0.7055 0.7595 0.7055 0.7594 0.7592 0.757223 0.840482
0.6894 Random Forest 0.6894 0.9838 0.6895 0.9838 0.6895 0.9838 0.9838 0.743143 0.999439
0.6422 Kmeans 0.6465 0.6571 0.6578 0.6670 0.6490 0.6584 0.6530 None None
0.6131 Decision Tree 0.6132 0.9838 0.6131 0.9841 0.6131 0.9840 0.9838 0.612691 0.999474

Cleaned dataset without smoke and alcool features¶

In [131]:
X_train, X_test, y_train, y_test = tt_split(df_cleaned[["age", "gender", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cholesterol_1", "cholesterol_2", "cholesterol_3", "glucose_1", "glucose_2", "glucose_3", "physical_activity", "cardio_disease"]])

result_matrixes["Cleaned dataset without smoke and alcool features"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Cleaned dataset without smoke and alcool features"])
Classifier:  Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'random_state': 1, 'splitter': 'random'}
ROC curve - test set
ROC curve - training set
Classifier:  Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier:  Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier:  Logistic Regression
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 1}
ROC curve - test set
ROC curve - training set
Classifier:  LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
Best hyperparameters : {'C': 5, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier:  MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7281 MLPClassifier 0.7281 0.7325 0.7295 0.7336 0.7288 0.7328 0.7324 0.791201 0.798018
0.7254 XGBClassifier 0.7256 0.7423 0.7280 0.7440 0.7266 0.7427 0.7420 0.793193 0.816049
0.7191 Logistic Regression 0.7193 0.7243 0.7213 0.7257 0.7202 0.7247 0.7240 0.780494 0.787784
0.7178 LinearSVC 0.7181 0.7230 0.7203 0.7246 0.7190 0.7234 0.7227 None None
0.7041 Nearest Neighbors 0.7041 0.7600 0.7046 0.7603 0.7045 0.7601 0.7600 0.755588 0.840698
0.6860 Random Forest 0.6861 0.9817 0.6861 0.9817 0.6861 0.9818 0.9817 0.742809 0.999219
0.6422 Kmeans 0.6465 0.6572 0.6579 0.6671 0.6490 0.6584 0.6531 None None
0.6222 Decision Tree 0.6223 0.9817 0.6223 0.9822 0.6224 0.9819 0.9817 0.621779 0.999332

Obese/overweight dataset¶

In [132]:
df_obese = df_cleaned[["age", "gender", "height", "weight", "BMI", "systolic_bp", "diastolic_bp", "cholesterol_1", "cholesterol_2", "cholesterol_3", "glucose_1", "glucose_2", "glucose_3", "physical_activity", "cardio_disease"]]
df_obese = df_obese.loc[BMI >= 25]

X_train, X_test, y_train, y_test = tt_split(df_obese)

result_matrixes["Obese/overweight dataset"] = classification(classifiers, X_train, y_train, X_test, y_test, n_iter=10)
display(result_matrixes["Obese/overweight dataset"])
Classifier:  Decision Tree
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'random_state': 1, 'splitter': 'random'}
ROC curve - test set
ROC curve - training set
Classifier:  Random Forest
Best hyperparameters : {'class_weight': 'balanced', 'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  XGBClassifier
Best hyperparameters : {'tree_method': 'exact', 'random_state': 1, 'n_estimators': 100, 'learning_rate': 0.05}
ROC curve - test set
ROC curve - training set
Classifier:  Nearest Neighbors
Best hyperparameters : {'weights': 'uniform', 'p': 1, 'n_neighbors': 9, 'n_jobs': -1, 'algorithm': 'kd_tree'}
ROC curve - test set
ROC curve - training set
Classifier:  Logistic Regression
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
Best hyperparameters : {'solver': 'sag', 'random_state': 1, 'n_jobs': -1, 'max_iter': 400, 'C': 1}
ROC curve - test set
ROC curve - training set
Classifier:  LinearSVC
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
/home/giovo17/.local/lib/python3.10/site-packages/sklearn/svm/_base.py:1225: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(
Best hyperparameters : {'C': 0.001, 'max_iter': 2000, 'random_state': 1}
ROC curve - test set
ROC curve - training set
Classifier:  Kmeans
Best hyperparameters : {'algorithm': 'lloyd', 'init': 'random', 'n_clusters': 2, 'random_state': 1}
Classifier:  MLPClassifier
Best hyperparameters : {'random_state': 1, 'max_iter': 400, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (8, 4, 4), 'activation': 'relu'}
ROC curve - test set
ROC curve - training set
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7192 XGBClassifier 0.7233 0.7422 0.7192 0.7378 0.7191 0.7378 0.7378 0.788441 0.813985
0.7185 MLPClassifier 0.7218 0.7292 0.7181 0.7248 0.7191 0.7259 0.7253 0.785613 0.786888
0.7174 Logistic Regression 0.7222 0.7202 0.7181 0.7154 0.7168 0.7141 0.7146 0.776475 0.775423
0.7151 LinearSVC 0.7202 0.7191 0.7162 0.7142 0.7143 0.7122 0.7130 None None
0.6942 Nearest Neighbors 0.6996 0.7555 0.6950 0.7516 0.6937 0.7496 0.7505 0.748889 0.83151
0.6819 Random Forest 0.6890 0.9873 0.6842 0.9862 0.6808 0.9882 0.9871 0.742657 0.999535
0.6155 Decision Tree 0.6217 0.9873 0.6158 0.9859 0.6154 0.9887 0.9871 0.614553 0.999673
0.3279 Kmeans 0.3436 0.3447 0.3265 0.3257 0.3301 0.3293 0.3272 None None

Clustering preprocessing on cleaned dataset¶

In [65]:
# Cluster validation
from sklearn.metrics import silhouette_score

total_WSS = []
silhouette_avgs = []
K = range(2, 50)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=1)
    cluster_labels = kmeans.fit_predict(df_cleaned.loc[:, df_cleaned.columns != "cardio_disease"])

    # Elbow method on WSS
    total_WSS.append(kmeans.inertia_)

    # Silhouette method
    silhouette_avgs.append(silhouette_score(df_cleaned.loc[:, df_cleaned.columns != "cardio_disease"], cluster_labels))


plt.subplot(2, 2, 1)
plt.plot(K, total_WSS, 'o-')
plt.xlabel('Number of clusters')
plt.ylabel('Total Within Sum of Squares')
plt.title('The elbow method on WSS')

plt.subplot(2, 2, 2)
plt.plot(K, silhouette_avgs, 'o-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette average')
plt.title('The silhouette method')

plt.show()

It doesn't make sense to procede with this analysis cause the dataset isn't clusterized

Final results¶

In [133]:
for data,result in result_matrixes.items():
    display(data, result)
'Plain dataset'
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7287 XGBClassifier 0.7287 0.7483 0.7297 0.7493 0.7296 0.7489 0.7482 0.796294 0.824923
0.7257 MLPClassifier 0.7257 0.7334 0.7269 0.7345 0.7267 0.7340 0.7333 0.793346 0.800074
0.7193 Logistic Regression 0.7193 0.7229 0.7202 0.7238 0.7201 0.7235 0.7229 0.782172 0.786405
0.7042 LinearSVC 0.7042 0.7093 0.7047 0.7100 0.7048 0.7098 0.7093 None None
0.6953 Random Forest 0.6955 0.9834 0.6953 0.9833 0.6954 0.9835 0.9834 0.747861 0.99942
0.6571 Nearest Neighbors 0.6571 0.7368 0.6574 0.7370 0.6575 0.7370 0.7368 0.710222 0.811528
0.6257 Decision Tree 0.6259 0.9834 0.6257 0.9836 0.6258 0.9838 0.9834 0.624746 0.999449
0.5669 Kmeans 0.5776 0.5796 0.5943 0.5944 0.5832 0.5834 0.5685 None None
'Cleaned dataset'
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7272 MLPClassifier 0.7276 0.7332 0.7307 0.7355 0.7287 0.7337 0.7328 0.79295 0.800408
0.7269 XGBClassifier 0.7272 0.7424 0.7297 0.7443 0.7281 0.7428 0.7421 0.793989 0.817191
0.7194 Logistic Regression 0.7196 0.7242 0.7216 0.7257 0.7205 0.7246 0.7240 0.781189 0.788357
0.7184 LinearSVC 0.7186 0.7229 0.7208 0.7245 0.7195 0.7233 0.7226 None None
0.7051 Nearest Neighbors 0.7051 0.7592 0.7055 0.7595 0.7055 0.7594 0.7592 0.757223 0.840482
0.6894 Random Forest 0.6894 0.9838 0.6895 0.9838 0.6895 0.9838 0.9838 0.743143 0.999439
0.6422 Kmeans 0.6465 0.6571 0.6578 0.6670 0.6490 0.6584 0.6530 None None
0.6131 Decision Tree 0.6132 0.9838 0.6131 0.9841 0.6131 0.9840 0.9838 0.612691 0.999474
'Cleaned dataset without smoke and alcool features'
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7281 MLPClassifier 0.7281 0.7325 0.7295 0.7336 0.7288 0.7328 0.7324 0.791201 0.798018
0.7254 XGBClassifier 0.7256 0.7423 0.7280 0.7440 0.7266 0.7427 0.7420 0.793193 0.816049
0.7191 Logistic Regression 0.7193 0.7243 0.7213 0.7257 0.7202 0.7247 0.7240 0.780494 0.787784
0.7178 LinearSVC 0.7181 0.7230 0.7203 0.7246 0.7190 0.7234 0.7227 None None
0.7041 Nearest Neighbors 0.7041 0.7600 0.7046 0.7603 0.7045 0.7601 0.7600 0.755588 0.840698
0.6860 Random Forest 0.6861 0.9817 0.6861 0.9817 0.6861 0.9818 0.9817 0.742809 0.999219
0.6422 Kmeans 0.6465 0.6572 0.6579 0.6671 0.6490 0.6584 0.6531 None None
0.6222 Decision Tree 0.6223 0.9817 0.6223 0.9822 0.6224 0.9819 0.9817 0.621779 0.999332
'Obese/overweight dataset'
Classifier Accuracy Accuracy (train) Precision Precision (train) Recall Recall (train) F1-Score (train) ROC AUC ROC AUC (train)
F1-Score
0.7192 XGBClassifier 0.7233 0.7422 0.7192 0.7378 0.7191 0.7378 0.7378 0.788441 0.813985
0.7185 MLPClassifier 0.7218 0.7292 0.7181 0.7248 0.7191 0.7259 0.7253 0.785613 0.786888
0.7174 Logistic Regression 0.7222 0.7202 0.7181 0.7154 0.7168 0.7141 0.7146 0.776475 0.775423
0.7151 LinearSVC 0.7202 0.7191 0.7162 0.7142 0.7143 0.7122 0.7130 None None
0.6942 Nearest Neighbors 0.6996 0.7555 0.6950 0.7516 0.6937 0.7496 0.7505 0.748889 0.83151
0.6819 Random Forest 0.6890 0.9873 0.6842 0.9862 0.6808 0.9882 0.9871 0.742657 0.999535
0.6155 Decision Tree 0.6217 0.9873 0.6158 0.9859 0.6154 0.9887 0.9871 0.614553 0.999673
0.3279 Kmeans 0.3436 0.3447 0.3265 0.3257 0.3301 0.3293 0.3272 None None